16 research outputs found

    Statistical methods for analyzing sequencing data with applications in modern biomedical analysis and personalized medicine

    Full text link
    There has been tremendous advancement in sequencing technologies; the rate at which sequencing data can be generated has increased multifold while the cost of sequencing continues on a downward descent. Sequencing data provide novel insights into the ecological environment of microbes as well as human health and disease status but challenge investigators with a variety of computational issues. This thesis focuses on three common problems in the analysis of high-throughput data. The goals of the first project are to (1) develop a statistical framework and a complete software pipeline for metagenomics that identifies microbes to the strain level and thus facilitating a personalized drug treatment targeting the strain; and (2) estimate the relative content of microbes in a sample as accurately and as quickly as possible. The second project focuses on the analysis of the microbiome variation across multiple samples. Studying the variation of microbiomes under different conditions within an organism or environment is the key to diagnosing diseases and providing personalized treatments. The goals are to (1) identify various statistical diversity measures; (2) develop confidence regions for the relative abundance estimates; (3) perform multi-dimensional and differential expression analysis; and (4) develop a complete pipeline for multi-sample microbiome analysis. The third project is focused on batch effect analysis. When analyzing high dimensional data, non-biological experimental variation or “batch effects” confound the true associations between the conditions of interest and the outcome variable. Batch effects exist even after normalization. Hence, unless the batch effects are identified and corrected, any attempts for downstream analyses, will likely be error prone and may lead to false positive results. The goals are to (1) analyze the effect of correlation of the batch adjusted data and develop new techniques to account for correlation in two step hypothesis testing approach; (2) develop a software pipeline to identify whether batch effects are present in the data and adjust for batch effects in a suitable way. In summary, we developed software pipelines called PathoScope, PathoStat and BatchQC as part of these projects and validated our techniques using simulation and real data sets

    Integrating microbial and host transcriptomics to characterize asthma-associated microbial communities.

    Get PDF
    BACKGROUND: The relationships between infections in early life and asthma are not completely understood. Likewise, the clinical relevance of microbial communities present in the respiratory tract is only partially known. A number of microbiome studies analyzing respiratory tract samples have found increased proportions of gamma-Proteobacteria including Haemophilus influenzae, Moraxella catarrhalis, and Firmicutes such as Streptococcus pneumoniae. The aim of this study was to present a new approach that combines RNA microbial identification with host gene expression to characterize and validate metagenomic taxonomic profiling in individuals with asthma. METHODS: Using whole metagenomic shotgun RNA sequencing, we characterized and compared the microbial communities of individuals, children and adolescents, with asthma and controls. The resulting data were analyzed by partitioning human and microbial reads. Microbial reads were then used to characterize the microbial diversity of each patient, and potential differences between asthmatic and healthy groups. Human reads were used to assess the expression of known genes involved in the host immune response to specific pathogens and detect potential differences between those with asthma and controls. RESULTS: Microbial communities in the nasal cavities of children differed significantly between asthmatics and controls. After read count normalization, some bacterial species were significantly overrepresented in asthma patients (Wald test, p-value \u3c 0.05), including Escherichia coli and Psychrobacter. Among these, Moraxella catarrhalis exhibited ~14-fold over abundance in asthmatics versus controls. Differential host gene expression analysis confirms that the presence of Moraxella catarrhalis is associated to a specific M. catarrhalis core gene signature expressed by the host. CONCLUSIONS: For the first time, we show the power of combining RNA taxonomic profiling and host gene expression signatures for microbial identification. Our approach not only identifies microbes from metagenomic data, but also adds support to these inferences by determining if the host is mounting a response against specific infectious agents. In particular, we show that M. catarrhalis is abundant in asthma patients but not in controls, and that its presence is associated with a specific host gene expression signature

    PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples.

    Get PDF
    BACKGROUND: Recent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework. RESULTS: We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4. CONCLUSIONS: The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/

    PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples.

    Get PDF
    BACKGROUND: Recent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework. RESULTS: We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4. CONCLUSIONS: The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/

    Inhibition of colony stimulating factor 1 receptor corrects maternal inflammation-induced microglial and synaptic dysfunction and behavioral abnormalities

    Get PDF
    Abstract Maternal immune activation (MIA) disrupts the central innate immune system during a critical neurodevelopmental period. Microglia are primary innate immune cells in the brain although their direct influence on the MIA phenotype is largely unknown. Here we show that MIA alters microglial gene expression with upregulation of cellular protrusion/neuritogenic pathways, concurrently causing repetitive behavior, social deficits, and synaptic dysfunction to layer V intrinsically bursting pyramidal neurons in the prefrontal cortex of mice. MIA increases plastic dendritic spines of the intrinsically bursting neurons and their interaction with hyper-ramified microglia. Treating MIA offspring by colony stimulating factor 1 receptor inhibitors induces depletion and repopulation of microglia, and corrects protein expression of the newly identified MIA-associated neuritogenic molecules in microglia, which coalesces with correction of MIA-associated synaptic, neurophysiological, and behavioral abnormalities. Our study demonstrates that maternal immune insults perturb microglial phenotypes and influence neuronal functions throughout adulthood, and reveals a potent effect of colony stimulating factor 1 receptor inhibitors on the correction of MIA-associated microglial, synaptic, and neurobehavioral dysfunctions

    Alternative empirical Bayes models for adjusting for batch effects in genomic studies

    No full text
    Abstract Background Combining genomic data sets from multiple studies is advantageous to increase statistical power in studies where logistical considerations restrict sample size or require the sequential generation of data. However, significant technical heterogeneity is commonly observed across multiple batches of data that are generated from different processing or reagent batches, experimenters, protocols, or profiling platforms. These so-called batch effects often confound true biological relationships in the data, reducing the power benefits of combining multiple batches, and may even lead to spurious results in some combined studies. Therefore there is significant need for effective methods and software tools that account for batch effects in high-throughput genomic studies. Results Here we contribute multiple methods and software tools for improved combination and analysis of data from multiple batches. In particular, we provide batch effect solutions for cases where the severity of the batch effects is not extreme, and for cases where one high-quality batch can serve as a reference, such as the training set in a biomarker study. We illustrate our approaches and software in both simulated and real data scenarios. Conclusions We demonstrate the value of these new contributions compared to currently established approaches in the specified batch correction situations

    Additional file 2 of Alternative empirical Bayes models for adjusting for batch effects in genomic studies

    No full text
    Mean and variance of gene expression distributions estimated from the EGFR signature and the TCGA breast cancer patient datasets. In TCGA, we used proteomics data of the patients, and binned the EGFR protein expression into 6 gradually increasing levels, partitioning all patients into 6 equal-sized groups. Mean and variances are estimated within each group. Up- and down-regulated genes are both EGFR signature genes derived by ASSIGN. The design and parameters for our simulation studies resemble the real estimates in these tables. Batch 1 represents the EGFR signature dataset with small gene variances, and a clear separation between the two condition groups in the expression of up-regulated genes. Batch 2 resembles the TCGA patient data with much larger variances than Batch 1. (XLSX 10 kb

    Additional file 3 of Alternative empirical Bayes models for adjusting for batch effects in genomic studies

    No full text
    Comparison of P-values for applying robust and non-robust F tests on the four experimental datasets. (XLSX 12 kb

    Pathoscope: species identification and strain attribution with unassembled sequencing data.

    Get PDF
    Emerging next-generation sequencing technologies have revolutionized the collection of genomic data for applications in bioforensics, biosurveillance, and for use in clinical settings. However, to make the most of these new data, new methodology needs to be developed that can accommodate large volumes of genetic data in a computationally efficient manner. We present a statistical framework to analyze raw next-generation sequence reads from purified or mixed environmental or targeted infected tissue samples for rapid species identification and strain attribution against a robust database of known biological agents. Our method, Pathoscope, capitalizes on a Bayesian statistical framework that accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of target genomes. Importantly, our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database. Furthermore, our approach can accurately discriminate between very closely related strains of the same species with very little coverage of the genome and without the need for multiple alignment steps, extensive homology searches, or genome assembly--which are time-consuming and labor-intensive steps. We demonstrate the utility of our approach on genomic data from purified and in silico environmental samples from known bacterial agents impacting human health for accuracy assessment and comparison with other approaches
    corecore